Objective and Background Information
The purpose of this project is to examine what characteristics/variables make a given song “popular”, and to create an algorithm that can predict whether a song will be popular or not. This algorithm could be used in many ways, such as determining which songs an artist should release if the primary goal is to release a popular song—this could be very useful for record labels trying to release hit songs.
Dataset
The dataset used comprises of around 170,000 songs on Spotify from 1970-2020 and was updated 11 days ago (11/25/2020). This article does a great job of explaining the various “audio features” that Spotify links to a song. Here are the first couple songs in the dataset as a reference:
| valence | year | acousticness | artists | danceability | duration_ms | energy | explicit | id | instrumentalness | key | liveness | loudness | mode | name | popularity | release_date | speechiness | tempo |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0594 | 1921 | 0.982 | [‘Sergei Rachmaninoff’, ‘James Levine’, ‘Berliner Philharmoniker’] | 0.279 | 831667 | 0.211 | 0 | 4BJqT0PrAfrxzMOxytFOIz | 8.78e-01 | 10 | 0.665 | -20.096 | 1 | Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve | 4 | 1921 | 0.0366 | 80.954 |
| 0.9630 | 1921 | 0.732 | [‘Dennis Day’] | 0.819 | 180533 | 0.341 | 0 | 7xPhfUan2yNtyFG0cUWkt8 | 0.00e+00 | 7 | 0.160 | -12.441 | 1 | Clancy Lowered the Boom | 5 | 1921 | 0.4150 | 60.936 |
| 0.0394 | 1921 | 0.961 | [‘KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat’] | 0.328 | 500062 | 0.166 | 0 | 1o6I8BglA6ylDMrIELygv1 | 9.13e-01 | 3 | 0.101 | -14.850 | 1 | Gati Bali | 5 | 1921 | 0.0339 | 110.339 |
| 0.1650 | 1921 | 0.967 | [‘Frank Parker’] | 0.275 | 210000 | 0.309 | 0 | 3ftBPsC5vPBKxYSee08FDH | 2.77e-05 | 5 | 0.381 | -9.316 | 1 | Danny Boy | 3 | 1921 | 0.0354 | 100.109 |
| 0.2530 | 1921 | 0.957 | [‘Phil Regan’] | 0.418 | 166693 | 0.193 | 0 | 4d6HGyGT8e121BsdKmw9v6 | 1.70e-06 | 3 | 0.229 | -10.096 | 1 | When Irish Eyes Are Smiling | 2 | 1921 | 0.0380 | 101.665 |
| 0.1960 | 1921 | 0.579 | [‘KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat’] | 0.697 | 395076 | 0.346 | 0 | 4pyw9DVHGStUre4J6hPngr | 1.68e-01 | 2 | 0.130 | -12.506 | 1 | Gati Mardika | 6 | 1921 | 0.0700 | 119.824 |
Our target variable is the “popularity” variable, but because it is a range from 0-100 (as seen below), we’ll make it a binary variable with a popularity >= 50 being a “popular” song, and < 50 being a “not popular” song.
Of course, there is no set cutoff for what would make a song “popular”, however, about 20% of the songs have a popularity >= 50 so that appears to be a fair cutoff. Likewise, the popularity variable is based on the amount of recent plays of a given song, so typically more recent songs will be more popular. Therefore, our analysis will reveal what qualities/variables make a song popular now in time, which would definitely be of more use to a record label/artist than popularity data from the past as trends in music change every year.
Questions
What factors contribute to making a song popular
Do songs with explicit content have other similar factors helps with families and parents
Do certain factors make a song more or less valence (>=.6 happy, .4<neutral<.6, <=.4 sad/angry)
Popularity has much to do with energy and danceability, with current waves of tiktok and such, does not take much to be popular, just need to be discovered and have the right aspects
Purpose for Exploration
Looking at popularity would be good to see the current cultural trends of the US
Helpful when composing music to see what factors play heavily into popularity
A reference for comparison can be found here and conducts a similar analysis of popularity on Spotify songs
Methods being Used
We decided to use random forests and xgboost due to their advantages in comprehension, little data preparation, and ability to handle numerical and categorical data.
What Factors Contribute to the Popularity of a Song at the End of 2020?
The first analysis we’ll conduct is building a random forest and eventually a boosted tree to try and predict if a song will be popular and what factors most impact popularity. We’ll also filter the data to only include songs from the 1970s or later, as those are the songs relevant to the analysis.
Cleaning the Data
First some data cleaning is necessary before building the models. The primary tasks are factoring/refactoring certain variables and creating thresholds to make binary or factorable variables.
Exploratory Analysis
Summary Statistics
Below are some relevant statistics regarding the popularity target variable:
Therefore, the base rate of the data is 36.90% and we can also view the breakup of how many songs fall into each decade. The majority are earlier than 2020 which would make sense as a decade consists of 10 years opposed to just 2020. We can also look at the percent of songs that were popular in each decade, and as one might expect there is a constant increase in popularity as time persists, with about 85% of the songs from 2020 being popular.
Data Visualizations
Hypothesis on Important Variables
Based on the above visualizations, we can initially hypothesize that songs that are easier to dance to, have more energy, and have a major scale will be more popular. It is interesting that the spread of the valence for popular vs non popular songs are similar, as we would’ve expected happier songs (higher valence) to be more popular. Furthermore, based on the analysis above I would expect the decade to be very important because more recent songs are more popular, however, the decade is unnecessary in our analysis because it is a rather obvious correlation and won’t give us much insight. Now we’ll build our models.
Random Forest
Testing and Training Data
The dataset will be split 90/10 training and testing.
Mtry level
The Mtry level is the number of variables randomly sampled as candidates at each split. The default number for classification is sqrt(# of variables).
Our mytry level comes out to 3.61 which we’ll round to 4.
Random Forest - 500 Trees
Initially, we will be generating a random forest made up of 500 trees, and an mtry of 4. In order to ensure that these trees are not all identical and have the opportunity to specialize in different subsets of the data, we will set the argument of replace to TRUE.
Model output:
##
## Call:
## randomForest(formula = is_popular ~ ., data = popular_train, ntree = 500, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 29.36%
## Confusion matrix:
## not popular popular class.error
## not popular 51186 6267 0.1090805
## popular 20477 13151 0.6089271
Initially we have an out of bag error of 29.36% which is poor but not too bad for our first try. With some tuning of the model we can bring this down. Also, based on the initial confusion matrix we can see the model is far better at correctly classifying unpopular songs rather than popular songs, which makes sense as the dataset is rich in unpopular instances.
Evaluating Model
Accuracy
The accuracy of our initial model is 70.64% which is fair, but is not a great metric because it may be skewed in terms of the model’s ability to predict true positives or true negatives. Let’s dig deeper.
Train vs Predict
If we take a look at the train set frequencies for popular vs unpopular, about 59% of the set are popular songs. However, when looking at the random forest’s predictions, about 27% of the predictions were popular songs. Therefore, we can assess that the model is erring on the side of classifying a song as unpopular.
Variable Importance
We can see that the variables of loudness, danceability, valence, and key are the most important.
Error Visualization
In the above table we can view the out of bag error rate for each individual tree, as well as the difference between the popular and unpopular error rates.
We can also create a plot that expresses each error component visually:
The error terms gradually flatten out after about 200 trees, therefore we can garner that a forest of 500 trees is rather excessive.
Confusion Matrix
## not popular popular class.error
## not popular 51186 6267 0.1090805
## popular 20477 13151 0.6089271
As mentioned above, the model is far superior at classifying unpopular songs than popular, and also errs on the side of classifying a song as unpopular, which is evident in a false positive rate of 29% and a false negative rate of 32% (although they are relatively close). The sensitivity is 39.11% again telling us the classifier is poor at predicting popular songs.
Error Table
131 Trees, Lowest Popular Error
Next we’ll build another forest with 131 trees which was the number of tress with the lowest error for the popular variable.
Model output:
##
## Call:
## randomForest(formula = is_popular ~ ., data = popular_train, ntree = 131, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 131
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 29.55%
## Confusion matrix:
## not popular popular class.error
## not popular 50732 6721 0.1169826
## popular 20190 13438 0.6003925
Accuracy of 70.45%, virtually the same as the 500 tree forest.
Comparing Random Forests
The 131 model has a tad more popular predictions than the 500 model, however, there is little difference between both.
Below both variable importance plots are displayed, the first for the 500 tree model, the second for the 131 tree model.
There is little difference between the variable importance plots, with loudness, key, danceability, and valence appearing to be the most important variables.
131 Forest Error Visualization
Confusion Matrices
A confusion matrix for the 500 trees is displayed first, then one for 131 trees.
## not popular popular class.error
## not popular 51186 6267 0.1090805
## popular 20477 13151 0.6089271
## not popular popular class.error
## not popular 50732 6721 0.1169826
## popular 20190 13438 0.6003925
The confusion matrices for both forests are very similar, telling us that limiting the number of trees in our forest doesn’t do much—therefore, let’s try and tune our model to see if that will improve our sensitivity.
Predictions on Test Data
131 Trees
First we have to use the predict function in R and add it to our test set so that we can use the confusion matrix function. The output of matrix is below:
## Confusion Matrix and Statistics
##
## Actual
## Prediction not popular popular
## not popular 5590 2231
## popular 812 1487
##
## Accuracy : 0.6993
## 95% CI : (0.6903, 0.7082)
## No Information Rate : 0.6326
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2969
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3999
## Specificity : 0.8732
## Pos Pred Value : 0.6468
## Neg Pred Value : 0.7147
## Precision : 0.6468
## Recall : 0.3999
## F1 : 0.4943
## Prevalence : 0.3674
## Detection Rate : 0.1469
## Detection Prevalence : 0.2272
## Balanced Accuracy : 0.6366
##
## 'Positive' Class : popular
##
As mentioned above, the sensitivity is very poor (40%) and our F1 score (a measure of how good our model is at predicting the positive class) is also low (0.49). Let’s use the tuneRF function to try and improve the model.
Tuning Model
Using the tuneRf function, we are now checking for the optimal number of variables to use/test during the tree building process.
## mtry OOBError
## 3.OOB 3 0.2818810
## 5.OOB 5 0.2806184
## 10.OOB 10 0.2832424
After running the function the mtry with the lowest out of bag error rate is 5 so we’ll run a model with that parameter.
Random Forest - 131 trees, mtry = 5
Model output:
##
## Call:
## randomForest(formula = is_popular ~ ., data = popular_train, ntree = 131, mtry = 5, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 131
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 29.8%
## Confusion matrix:
## not popular popular class.error
## not popular 51012 6441 0.1121090
## popular 20699 12929 0.6155287
Even with the parameter change, the model still doesn’t perform much better (out of bag error rate of 29.8%)
ROC for 131 trees, 4 mtry
As a final metric, we can assess the ROC curve and calculate an AUC (area under curve) which is an expression of the balance of the sensitivity and specificity.
Our AUC comes out to 0.72 which is fair while not great. It is low due to a low sensitivity, as discovered previously.
XGBoost Model
As a last resort, let’s see if boosting our tree will aid in the model’s ability to predict the positive class. We’ll use the xgboost (extreme gradient boosting) package which utilizes residual error as a loss function to gradually improve the model. For the model parameters, we decided on a max depth of 6 (as to limit overfitting), an eta of 0.1 (high learning rate), and 400 “passes” through the data.
Some data preparation is necessary but basically just involves one hot encoding the dataset in order to be passed into the xgboost function.
Model Output
Below is the raw output of the boosted model where one can view the parameters and features of the tree.
## ##### xgb.Booster
## raw: 1.2 Mb
## call:
## xgb.train(params = params, data = dtrain, nrounds = nrounds,
## watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
## early_stopping_rounds = early_stopping_rounds, maximize = maximize,
## save_period = save_period, save_name = save_name, xgb_model = xgb_model,
## callbacks = callbacks, max.depth = 6, eta = 0.1, objective = "binary:logistic")
## params (as set within xgb.train):
## max_depth = "6", eta = "0.1", objective = "binary:logistic", validate_parameters = "TRUE"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## cb.evaluation.log()
## # of features: 26
## niter: 400
## nfeatures : 26
## evaluation_log:
## iter train_error
## 1 0.293585
## 2 0.291268
## ---
## 399 0.215270
## 400 0.215105
Error Plot
Below is a plot that expresses the decrease in error on the train set as the iterations (400) increase. We can see how quickly the model learns and how the error drops fast, an advantage of the xgboost package.
Variable Importance
We can also pull the variable importance to see if the boosted model selects simlilar important variables as the random forests.
The xgboost model has similar important variables (loudness, valence, danceability) however it is interesting that the duration of the song is ranked so high.
Confusion Matrix
We can also pull a confusion matrix from the model to compare it with the past random forest models.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 5427 1820
## 1 975 1898
##
## Accuracy : 0.7238
## 95% CI : (0.715, 0.7325)
## No Information Rate : 0.6326
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3761
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5105
## Specificity : 0.8477
## Pos Pred Value : 0.6606
## Neg Pred Value : 0.7489
## Precision : 0.6606
## Recall : 0.5105
## F1 : 0.5759
## Prevalence : 0.3674
## Detection Rate : 0.1875
## Detection Prevalence : 0.2839
## Balanced Accuracy : 0.6791
##
## 'Positive' Class : 1
##
The xgboosted tree performs much better than the random forests, as expected due to the more complex nature of xgboost. Below is a comparison of the xgboost model metrics to the 131 tree model (metrics in parenthesis):
Accuracy = 72.38% (69.93%)
Kappa = 0.38 (0.30)
Sensitivity = 51.05% (39.99%)
Specificity = 84.77% (87.32%)
F1 = 0.58 (0.49)
Balanced Accuracy = 67.91% (63.66%)
The primary metric we cared about improving was the sensitivity (better at predicting popular songs) and the xgboost model improved by about 11 percentage points to a value of 51%. While still not excellent, it is much better than the random forest models. Similarly, the F1 score is 0.58 opposed to the 0.49 of the 131 tree random forest, indicating the boosted model is better at classifying the positive outcome. However, we see that the specificity for the boosted model is about 3.5 points lower than the random forest, indicating the false positive rate is higher for the boosted tree—that can be expected because the model classifies more positive cases. Also, when using xgboost we have to be cautious
Conclusions
Overall, after building the various random forest and xgboost models, the xgboost model performed the best. While the sensitivity and rate at which the model correctly classifies the positive class (a popular song) are fairly low, the model still performs decently. Also, because the nature of the classifier is rather complex/subjective (there are no defined variables or method to determine if a song is “popular”) we didn’t expect the models to have excellent prediction metrics. I would recommend that a record label/artist only use the xgboost model to get a sense for how popular a song might be, but would not put to much trust in it’s hands.
Do Songs with Explicit Content Generally Sound the Same?
For the second part of our project, we hope to explore just the songs labeled as having explicit content. We will consider several metrics given in the dataset to judge whether or not these songs “sound” similar. These factors will include energy, valence, danceability, popularity, and “speechiness”.
Exploratory Analysis
We will first look at the songs classified as explicit (using spotify2$explicit to pull the data), and collect some basic summary statistics about the set.
Summary Statistics
We see that the base rate of the data is 11.77%. From this table, we can see that the vast majority of explicit songs have come from the current decade (49% in the 2020s), and decreased progressively as we go further into the past. Now, let’s look at some data visualizations comparing the explicit songs to the factors listed in the aforementioned section.
Data Visualizations
The first box plot compares non-explicit and explicit songs by the danceability rating provided by Spotify. Explicit songs seem to have a slight edge in this metric, although there are an overall smaller number of songs in this category. Likewise, explicit songs seem to have a slightly higher mean in the “energy” metric as well, while non-explicit songs have higher valence. Explicit songs are almost 10 percentage points more popular on average than non-explicit music (around 55 percent to 45 for clean), and have a much higher speechiness value.
In terms of diversity of key selection, the distributions seem to be a bit different between the two categories, despite the vast difference in sample size. That being said, explicit songs tend to have more C# key, while clean music focuses more on C, D, G, and A keys.
The last bar graph illustrates the rise of the proportion of explicit songs in the entire population of the dataset as time progresses. This graph is a pictoral representation of the table we generated above.
Selecting Important Variables
Based off these graphs, I would hypothesize that key selection would be an important factor in distinguishing explicit and non-explicit songs. The C# key is far more prevalent in proportion to other keys in explicit music, while the C, D, G, and A appear more in clean music. Speechiness should also be important, given the large difference in the box plots.
Decision Tree
Model output:
## n= 101201
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 101201 11899 no explicit (0.88242211 0.11757789)
## 2) speechiness< 0.1435 88366 5385 no explicit (0.93906027 0.06093973) *
## 3) speechiness>=0.1435 12835 6321 explicit (0.49248150 0.50751850)
## 6) decade=1970s,1980s 2269 196 no explicit (0.91361833 0.08638167) *
## 7) decade=1990s,2000s,2010s,2020 10566 4248 explicit (0.40204429 0.59795571)
## 14) speechiness< 0.2345 4421 2078 no explicit (0.52997059 0.47002941)
## 28) danceability< 0.6665 1912 655 no explicit (0.65742678 0.34257322) *
## 29) danceability>=0.6665 2509 1086 explicit (0.43284177 0.56715823) *
## 15) speechiness>=0.2345 6145 1905 explicit (0.31000814 0.68999186) *
Variable Importance:
## speechiness decade popularity danceability
## 4735.762749 977.678855 348.586247 130.664930
## energy loudness acousticness liveness
## 57.843841 43.005347 41.511290 33.609057
## tempo valence instrumentalness duration_ms
## 23.014297 23.014297 16.489445 6.563738
From these variable importance metrics, we can that speechiness, decade, popularity, and danceability are the four most important variables for predicting a song with explicit content. Our initial hypothesis was slightly correct. I would guess that key does not appear because the size of the dataset means there’s still a massive difference between non-explicit and explicit songs regardless of the specific chord.
Plotted Tree
The decision tree firsts uses speechiness to differentiate between the two categories. If the speechiness is less than 0.14, the song is automatically classified as non-explicit, otherwise, the tree moves on to the second step, which uses decade. If the decade of the song is the 1970s or 1980s, the song is immediately non-explicit, otherwise the model proceeds to the third step, which again uses speechiness. If the speechiness is above 0.23, the song is classified as explicit, otherwise the model proceeds to its fourth and final step, which is determined by danceability. If the danceability metric is below 0.67, the song is non-explicit, otherwise it is classified as explicit.
Plotted CP
Based on the above plot, it appears that four splits is the ideal amount for this model. Our tree above does indeed have four splits.
CP Table
| CP | nsplit | rel error | xerror | xstd | opt |
|---|---|---|---|---|---|
| 0.0869821 | 0 | 1.0000000 | 1.0000000 | 0.0086116 | 1.0086116 |
| 0.0252962 | 2 | 0.8260358 | 0.8260358 | 0.0079170 | 0.8339528 |
| 0.0100000 | 4 | 0.7754433 | 0.7796453 | 0.0077146 | 0.7831579 |
The CP Table confirms the above result that four splits reduces the relative error by a greater margin than 0 and 2.
Evaluating Model
Predicting Values
Now, we will try to determine the optimal model at predicting explicit songs. We will accomplish this by producing a fitted model using the type “class.” Then we will compare the actual results to the predicted ones to determine accuracy
Actual Split
Predicted Split
Confusion Matrix
## Confusion Matrix and Statistics
##
## Actual
## Prediction no explicit explicit
## no explicit 86311 6236
## explicit 2991 5663
##
## Accuracy : 0.9088
## 95% CI : (0.907, 0.9106)
## No Information Rate : 0.8824
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5017
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.47592
## Specificity : 0.96651
## Pos Pred Value : 0.65438
## Neg Pred Value : 0.93262
## Prevalence : 0.11758
## Detection Rate : 0.05596
## Detection Prevalence : 0.08551
## Balanced Accuracy : 0.72121
##
## 'Positive' Class : explicit
##
The confusion matrix above tells us that the overall accuracy of our model is 90.88%, which is pretty good, especially for a dataset of this magnitude. The f1 score, which considers the precision (true positive rate) and recall (Sensitivity rating) is 0.55, which is also a pretty good result. The detection rate,the rate at which the algorithm detects the positive class in proportion to the entire classification [A/(A+B+C+D) where A is true positives] is 0.056.
ROC
##
## Call:
## roc.default(response = explicit_dataset$explicit, predictor = as.numeric(explicit_fitted_model), plot = TRUE)
##
## Data: as.numeric(explicit_fitted_model) in 89302 controls (explicit_dataset$explicit no explicit) < 11899 cases (explicit_dataset$explicit explicit).
## Area under the curve: 0.7212
The ROC value is another metric which depicts the accuracy of our model. As we can see from the above calculation, the area under the curve (or AUC) value for our predicted model is 0.7212, which is pretty good but could definitely be improved. Let’s change the thresholds and run a random forest to see if we can create an optimal model.
Changing thresholds
## Confusion Matrix and Statistics
##
## Actual
## Prediction no explicit explicit
## no explicit 89302 11899
## explicit 0 0
##
## Accuracy : 0.8824
## 95% CI : (0.8804, 0.8844)
## No Information Rate : 0.8824
## P-Value [Acc > NIR] : 0.5024
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8824
## Precision : NA
## Recall : 0.0000
## F1 : NA
## Prevalence : 0.1176
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : explicit
##
| threshold | accuracy | tpr | fpr | kappa | f1 |
|---|---|---|---|---|---|
| 0.2 | 0.9029 | 0.53097 | 0.04757 | 0.5081 | 0.56247 |
| 0.4 | 0.9088 | 0.47592 | 0.03349 | 0.5017 | 0.55106 |
| 0.6 | 0.9055 | 0.35633 | 0.02133 | 0.4238 | 0.46996 |
It appears like 0.2 is the ideal threshold for this model, as it has the highest TPR and f1 values. Now we can set up our Random Forest.
Random Forest
Testing and Training Data
The dataset will be split 90/10 training and testing.
Mtry level
The Mtry level is the number of variables randomly sampled as candidates at each split. The default number for classification is sqrt(# of variables).
The mytry comes out to 3.6 which we’ll round to 4.
Random Forest - 500 Trees
Initially, we will be generating a random forest made up of 500 trees, and an mtry of 4. In order to ensure that these trees are not all identical and have the opportunity to specialize in different subsets of the data, we will set the argument of replace to TRUE.
Model Output:
##
## Call:
## randomForest(formula = explicit ~ ., data = explicit_train, ntree = 500, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 9.41%
## Confusion matrix:
## no explicit explicit class.error
## no explicit 79733 647 0.008049266
## explicit 7927 2774 0.740771890
Evaluating Model
Accuracy
The overall accuracy of our 500 tree model is about 90.5%, which is pretty good but not much better than our initial model.
Actual and Predicted
Important Variables
| no explicit | explicit | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|
| valence | 0.0024474 | 0.0014454 | 0.0023297 | 0.8000776 |
| acousticness | 0.0014462 | 0.0063287 | 0.0020197 | 0.8326905 |
| danceability | 0.0047170 | 0.0294600 | 0.0076238 | 1.8215381 |
| duration_ms | 0.0004881 | 0.0009567 | 0.0005431 | 0.7652550 |
| energy | 0.0025200 | 0.0023346 | 0.0024981 | 0.8460576 |
| instrumentalness | 0.0004481 | 0.0179402 | 0.0025030 | 0.7458689 |
| key | 0.0002385 | 0.0066839 | 0.0009957 | 2.0894633 |
| liveness | 0.0004334 | 0.0004604 | 0.0004366 | 0.7669011 |
| loudness | 0.0025724 | 0.0130235 | 0.0038002 | 0.9867742 |
| mode | -0.0000509 | 0.0009839 | 0.0000707 | 0.1285905 |
| popularity | 0.0008627 | 0.0253006 | 0.0037336 | 1.5071601 |
| speechiness | 0.0136789 | 0.1365978 | 0.0281193 | 4.7354641 |
| tempo | 0.0011692 | 0.0012078 | 0.0011738 | 0.7863611 |
| decade | 0.0011948 | 0.0276167 | 0.0042987 | 1.0702378 |
Data Visualization
Confusion Matrix
## no explicit explicit class.error
## no explicit 79733 647 0.008049266
## explicit 7927 2774 0.740771890
Error Tables
Looking at this error table, we can see a few interesting values pop out immediately, most notably 248, which has the among the lowest Out of Box (OOB) and Popular Error. Thus we will pick this value to run our optimal tree.
248 Trees, Lowest OOB and Popular Error
Model Output:
##
## Call:
## randomForest(formula = explicit ~ ., data = explicit_train, ntree = 248, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 248
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 9.2%
## Confusion matrix:
## no explicit explicit class.error
## no explicit 79611 769 0.009567056
## explicit 7615 3086 0.711615737
We can see from our accuracy metric that this tree has an overall accuracy rate of 90.79%, which is only marginally higher than our 500 tree model was. Let’s compare our two models further.
Comparing Random Forests
Both variable importance plots are displayed, the first for the 500 tree model, the second for the 248 tree model.
In both the meanDecreaseAccuracy and meanDecreaseGini categories, speechiness is far and away the most important variable in identifying explicit songs, followed by danceability, popularity, and decade. Key is also important for meanDecreaseGini, which further supports the initial hypothesis made at the beginning.
248 Forest Error Visualization
Confusion Matrices
A confusion matrix for the 500 trees is displayed first, then one for 248 trees.
## no explicit explicit class.error
## no explicit 79733 647 0.008049266
## explicit 7927 2774 0.740771890
## no explicit explicit class.error
## no explicit 79611 769 0.009567056
## explicit 7615 3086 0.711615737
Predictions on Test Data
248 Trees
First we use the predict function in order to create a confusion matrix.
## Confusion Matrix and Statistics
##
## Actual
## Prediction no explicit explicit
## no explicit 8832 847
## explicit 90 351
##
## Accuracy : 0.9074
## 95% CI : (0.9016, 0.913)
## No Information Rate : 0.8816
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3894
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.29299
## Specificity : 0.98991
## Pos Pred Value : 0.79592
## Neg Pred Value : 0.91249
## Precision : 0.79592
## Recall : 0.29299
## F1 : 0.42831
## Prevalence : 0.11838
## Detection Rate : 0.03468
## Detection Prevalence : 0.04358
## Balanced Accuracy : 0.64145
##
## 'Positive' Class : explicit
##
Tuning Model
Using the tuneRf function, we are now checking for the optimal number of variables to use/test during the tree building process.
## mtry OOBError
## 3.OOB 3 0.07141994
## 5.OOB 5 0.07163953
## 10.OOB 10 0.07140897
The tuneRF results show the 10 is probably the best mTry value to test, however not by a very large margin.
Conclusions
Through this process, we have identified several important quantifiers for classifying explicit songs without doing any lyric analysis. We can sufficiently conclude that Speechiness is the most important variable in identifying explicit music, as the importance metric of the random forests show. In addition, danceability, key, and popularity are also important variables to consider.
What Makes a Song Possess High or Low Valence?
The final part of the dataset that we will consider consists of the factors that qualify a song as possessing high or low valence.
Exploratory Analysis
Correlation between Quantitative Variables
| valence | acousticness | danceability | duration_ms | energy | instrumentalness | liveness | loudness | speechiness | tempo | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| valence | 1.0000 | -0.2102 | 0.5198 | -0.1503 | 0.3317 | -0.2318 | -0.0252 | 0.2493 | 0.0272 | 0.1061 | -0.1597 |
| acousticness | -0.2102 | 1.0000 | -0.1930 | -0.0508 | -0.7079 | 0.1999 | -0.0677 | -0.5568 | -0.1070 | -0.1585 | -0.1467 |
| danceability | 0.5198 | -0.1930 | 1.0000 | -0.1113 | 0.1297 | -0.2724 | -0.1306 | 0.2538 | 0.1953 | -0.1098 | 0.1677 |
| duration_ms | -0.1503 | -0.0508 | -0.1113 | 1.0000 | 0.0000 | 0.1153 | 0.0540 | -0.0597 | -0.0381 | -0.0353 | -0.1084 |
| energy | 0.3317 | -0.7079 | 0.1297 | 0.0000 | 1.0000 | -0.1996 | 0.1756 | 0.7518 | 0.1507 | 0.2051 | 0.1447 |
| instrumentalness | -0.2318 | 0.1999 | -0.2724 | 0.1153 | -0.1996 | 1.0000 | -0.0275 | -0.3964 | -0.1049 | -0.0673 | -0.0681 |
| liveness | -0.0252 | -0.0677 | -0.1306 | 0.0540 | 0.1756 | -0.0275 | 1.0000 | 0.0743 | 0.1396 | 0.0160 | -0.0511 |
| loudness | 0.2493 | -0.5568 | 0.2538 | -0.0597 | 0.7518 | -0.3964 | 0.0743 | 1.0000 | 0.1190 | 0.1669 | 0.3386 |
| speechiness | 0.0272 | -0.1070 | 0.1953 | -0.0381 | 0.1507 | -0.1049 | 0.1396 | 0.1190 | 1.0000 | 0.0346 | 0.1844 |
| tempo | 0.1061 | -0.1585 | -0.1098 | -0.0353 | 0.2051 | -0.0673 | 0.0160 | 0.1669 | 0.0346 | 1.0000 | 0.0135 |
| year | -0.1597 | -0.1467 | 0.1677 | -0.1084 | 0.1447 | -0.0681 | -0.0511 | 0.3386 | 0.1844 | 0.0135 | 1.0000 |
From this table we will remove acousticness because of it’s extremely low correlation with energy.
Summary Statistics
The dataset will be divided into three categories: a >0.6 valence value will be considered as “happy/cheerful”, 0.4 to 0.6 as “neutral” and less than 04 as “sad/depressed.” The results for each decade are displayed below:
| decade | valence_fact | count |
|---|---|---|
| 1970s | happy/cheerful | 10128 |
| 1970s | neutral | 6616 |
| 1970s | sad/depressed | 3256 |
| 1980s | happy/cheerful | 9654 |
| 1980s | neutral | 6177 |
| 1980s | sad/depressed | 4019 |
| 1990s | happy/cheerful | 9001 |
| 1990s | neutral | 6540 |
| 1990s | sad/depressed | 4360 |
| 2000s | happy/cheerful | 8301 |
| 2000s | neutral | 6959 |
| 2000s | sad/depressed | 4386 |
| 2010s | happy/cheerful | 5737 |
| 2010s | neutral | 8140 |
| 2010s | sad/depressed | 5897 |
| 2020 | happy/cheerful | 690 |
| 2020 | neutral | 920 |
| 2020 | sad/depressed | 420 |
The various base rates are as follows:
Happy = 42.99%
Neutral = 34.93%
Sad = 22.07%
Data Visualizations
As we would initially expect, happy songs tend to have the highest energy level, followed by neutral and sad. The tempo of the three categories all tend to hover between 100 and 125, and the popularity between 37 and 50, without a real statistically significant difference. In terms of key, the graphs look pretty much the exact same across all three categories. Finally, the bar graph shows a rise in “neutral” and “sad” songs in recent decades, whereas the amount of happy songs has gradually decreased over time.
Selecting Important Variables
Instead of a random forest, we will use a kNN/k-means model to determine the valence of a song using its k-nearest neighbors.
kNN/k-Means
Training/Testing Data
For our algorithm, we have decided to do a 90/10 split of the data for training and testing.
3NN
The first section of our model will be a 3-nearest neighbors model, using the above 90/10 train/test split.
Evaluating Model
Confusion Matrix and Accuracy
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 10120
##
##
## | valence_3NN
## valence_test$valence_fact | happy/cheerful | neutral | sad/depressed | Row Total |
## --------------------------|----------------|----------------|----------------|----------------|
## happy/cheerful | 2971 | 1100 | 283 | 4354 |
## | 0.682 | 0.253 | 0.065 | 0.430 |
## | 0.641 | 0.328 | 0.133 | |
## | 0.294 | 0.109 | 0.028 | |
## --------------------------|----------------|----------------|----------------|----------------|
## neutral | 1323 | 1471 | 752 | 3546 |
## | 0.373 | 0.415 | 0.212 | 0.350 |
## | 0.285 | 0.439 | 0.353 | |
## | 0.131 | 0.145 | 0.074 | |
## --------------------------|----------------|----------------|----------------|----------------|
## sad/depressed | 343 | 781 | 1096 | 2220 |
## | 0.155 | 0.352 | 0.494 | 0.219 |
## | 0.074 | 0.233 | 0.514 | |
## | 0.034 | 0.077 | 0.108 | |
## --------------------------|----------------|----------------|----------------|----------------|
## Column Total | 4637 | 3352 | 2131 | 10120 |
## | 0.458 | 0.331 | 0.211 | |
## --------------------------|----------------|----------------|----------------|----------------|
##
##
Our 3-NN model resulted in an overall accuracy rate of 54.72%, which isn’t very good considering the breadth of the dataset. Let’s see if we can determine an optimal k-value that will better increase the overall accuracy of our prediction:
Choosing Optimal K
Looking at the above dataframe, the overall accuracy of the model gradually increases with a higher choice of k, culminating with a value around 59.6% for a 21-NN model. However, the difference in accuracy begins to flatten out around 6 nearest neighbors, so this is what we will pick for our optimized model.
Elbow Plot:
By using the elbow plot above, we see that the marginal increase of accuracy stops at around k=6 so now we’ll build a model with 6 nearest neighbors.
6NN
Evaluating Model
COnfusion Matrix and Accuracy
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 10120
##
##
## | valence_6NN
## valence_test$valence_fact | happy/cheerful | neutral | sad/depressed | Row Total |
## --------------------------|----------------|----------------|----------------|----------------|
## happy/cheerful | 3086 | 1080 | 188 | 4354 |
## | 0.709 | 0.248 | 0.043 | 0.430 |
## | 0.661 | 0.309 | 0.096 | |
## | 0.305 | 0.107 | 0.019 | |
## --------------------------|----------------|----------------|----------------|----------------|
## neutral | 1286 | 1593 | 667 | 3546 |
## | 0.363 | 0.449 | 0.188 | 0.350 |
## | 0.275 | 0.456 | 0.341 | |
## | 0.127 | 0.157 | 0.066 | |
## --------------------------|----------------|----------------|----------------|----------------|
## sad/depressed | 300 | 818 | 1102 | 2220 |
## | 0.135 | 0.368 | 0.496 | 0.219 |
## | 0.064 | 0.234 | 0.563 | |
## | 0.030 | 0.081 | 0.109 | |
## --------------------------|----------------|----------------|----------------|----------------|
## Column Total | 4672 | 3491 | 1957 | 10120 |
## | 0.462 | 0.345 | 0.193 | |
## --------------------------|----------------|----------------|----------------|----------------|
##
##
The overall accuracy of our model slightly increased to 57.12% here, indicating that a 6-NN model would probably be preferred to our 3-NN one. The 6-NN model was also marginally better in each category at predicting the correct outcome. Further testing might be needed to further optimize this model for an even greater accuracy.
Conclusions
The k-NN/k-Means model may not be the best method to use when trying to determine song valence. Compared to the first two random forests (which had overall accuracy rates in the low 90s), this model wasn’t much better than randomly guessing. The algorithm correctly outputted the valence of a song only 57.12% of the time, which indicates that it is definitely not good enough for use in any analytics setting.
Future Work
In the future, the random forest models could be implemented on individual user data, to determine what their preferred music taste is, and recommend suggestions based on their likes and dislikes. Our explicit content random forest could be a part of a parental controls filter, which prevents children from discovering and listening to explicit music on certain platforms. In addition, artists, music engineers and record labels could use our popularity xgboost model to determine the best formulas for creating hit songs to maximize their profit and increase listener count. However, one must take into account that the popularity of music is inherently subjective and there is no set formula to create a hit song, especially with rapid changing trends spurred on by social media apps such as TikTok. Nonetheless, the popularity models could be implemented every now and then to try and track changing trends and jump on them to try and maximize profit.